A Comparison of Manual and Automatic Constructions of Category Hierarchy for Classifying Large Corpora

نویسندگان

  • Fumiyo Fukumoto
  • Yoshimi Suzuki
چکیده

We address the problem dealing with a large collection of data, and investigate the use of automatically constructing category hierarchy from a given set of categories to improve classification of large corpora. We use two wellknown techniques, partitioning clustering, means and a to create category hierarchy. -means is to cluster the given categories in a hierarchy. To select the proper number of , we use a which measures the degree of our disappointment in any differences between the true distribution over inputs and the learner’s prediction. Once the optimal number of is selected, for each cluster, the procedure is repeated. Our evaluation using the 1996 Reuters corpus which consists of 806,791 documents shows that automatically constructing hierarchy improves classification accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sensitometric characteristics of D-, E- and F-speed dental radiographic films in manual and automatic processing

BACKGROUND: The purpose of this study was to evaluate the sensitometric characteristics of Ultraspeed, Ektaspeed Plus and Insight dental radiographic films using manual and automatic processing systems. METHODS: In this experimental invitro study, an aluminum step-wedge was used to construct characteristic curves for D-, E- and F-speed radiographic films (Kodak Eastman, Rochester, USA). All fil...

متن کامل

مقایسۀ کاربرد انواع روش‎های ارزیابی دسترس‎پذیری وب‎سایت‎ها مطالعۀ موردی: وب‎سایت وزارتخانه‌های دولت جمهوری اسلامی ایران)

Purpose: The present research aims to comparatively study different methods for evaluating the accessibility of websites and analyze the results of case study concerning websites of ministries of Iranian government, in order to indicate the strengths, weaknesses, and differences in evaluation findings by applying each of website accessibility methods. Methodology: In this paper, initially the ...

متن کامل

The Impact of Different Frequency Patterns on the Syntactic Production of a 6-year-old EFL Home Learner: A Case Study

This longitudinal study investigated the impact of different Frequency Patterns (FP) on the syntactic production of a six-year-old EFL learner in a home context. Target syntactic constructions were presented using games and plays and were traced for their occurrence patterns in input and output. Following each instruction period, the constructions were measured through immediate and delayed ora...

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

An Automatic Fingerprint Classification Algorithm

Manual fingerprint classification algorithms are very time consuming, and usually not accurate. Fast and accurate fingerprint classification is essential to each AFIS (Automatic Fingerprint Identification System). This paper investigates a fingerprint classification algorithm that reduces the complexity and costs associated with the fingerprint identification procedure. A new structural algorit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004